15  Exploring Data with R

To recap what we learnt in the previous sessions.. we now know to work within the R Project environment. here::here() makes it easy for us to manage file paths. You can quickly have a look at your data using the View() and glimpse() functions. Most of the tidy data is read as tibble which is a workhorse of tidyverse.

It is here::here() is better than setwd()

here::here() allows us to filepaths very easily

15.1 Getting Started with the Data Exploration Pipeline

15.1.1 Set-up

#install.packages("pacman")


pacman::p_load(tidyverse, here)

#tidyverse required for tidy workflows
#rio required for importing and exporting data
#here required for managing file paths

Note

The shortcut for code commenting is Ctrl+Shift+C.

15.1.2 Load Data

The dataset we will be working with has been cleaned (to an extent) for the purposes of this workshop. It is a dataset about NHANES that has been took from the NHANES and cleaned up and modified for our use.

# Check the file path
here::here("data", "nhanes_modified_df.rds")
[1] "D:/research_methods_analysis/rmda_book/data/nhanes_modified_df.rds"
# Read Data
df <- read_rds(here("data", "nhanes_modified_df.rds"))

Try the following functions using tb as the argument:

  • glimpse()
  • head()
  • names()

Now, we will be introducing you to two new packages:

  1. dplyr
  2. skimr
  3. DataExplorer

15.2 dplyr Package

The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In short, it makes data exploration and data manipulation easy and fast in R.

There are many verbs in dplyr that are useful, some of them are given here…

Important functions of the dplyr package to remember

Syntax structure of the dplyr verb

15.2.1 Getting used to the pipe |> or %>%

The pipe operator in dplyr

Note

The pipe |> means THEN…

The pipe is an operator in R that allows you to chain together functions in dplyr.

Let’s find the bottom 50 rows of tb without and with the pipe.

Tips The native pipe |> is preferred.

#without the pipe
tail(df, n = 50)

#with the pipe
df |> tail(n = 50)

Now let’s see what the code looks like if we need 2 functions. Find the unique age in the bottom 50 rows of df

#without the pipe
unique(tail(df, n = 50)$age)

# with the pipe
df |> 
  tail(50) |>
  distinct(age)

Note

The shortcut for the pipe is Ctrl+Shift+M

You will notice that we used different functions to complete our task. The code without the pipe uses functions from base R while the code with the pipe uses a mixture (tail() from base R and distinct() from dplyr). Not all functions work with the pipe, but we will usually opt for those that do when we have a choice.

15.2.2 distinct() and count()

The distinct() function will return the distinct values of a column, while count() provides both the distinct values of a column and then number of times each value shows up. The following example investigates the different race (race) in the df dataset:

df |> 
  distinct(race) 

df |> 
  count(race)

Notice that there is a new column produced by the count function called n.

15.2.3 arrange()

The arrange() function does what it sounds like. It takes a data frame or tbl and arranges (or sorts) by column(s) of interest. The first argument is the data, and subsequent arguments are columns to sort on. Use the desc() function to arrange by descending.

The following code would get the number of times each race is in the dataset:

df |> 
  count(race) |> 
  arrange(n)

# Since the default is ascending order, 
# we are not getting the results that are probably useful, 
# so let's use the desc() function
df |> 
  count(race) |> 
  arrange(desc(n))

# shortcut for desc() is -
df |> 
  count(race) |> 
  arrange(-n)

15.2.4 filter()

If you want to return rows of the data where some criteria are met, use the filter() function. This is how we subset in the tidyverse. (Base R function is subset())

Here are the logical criteria in R:

  • ==: Equal to
  • !=: Not equal to
  • >: Greater than
  • >=: Greater than or equal to
  • <: Less than
  • <=: Less than or equal to

If you want to satisfy all of multiple conditions, you can use the “and” operator, &.

The “or” operator | (the vertical pipe character, shift-backslash) will return a subset that meet any of the conditions.

Let’s see all the data for age 60 or above

df |> 
  filter(age >= 60)

Let’s just see data for white

df |> 
  filter(race == "White")

Both White and age 60 or more

df |> 
  filter(age >= 60 & race == "White")

Lets create a column called ind_sub if the country is in the Indian Subcontinent.

# use our vector indian_subcont that we created before
tb |> 
  mutate(indian_sub1 = if_else(country %in% indian_subcont, 
                              "Indian Subcontinent", "Others"))

The same thing can be done using case_when().

tb |> 
  mutate(indian_sub2 = case_when(country %in% 
                                   indian_subcont ~ 
                                   "Indian Subcontinent",
                           TRUE ~ "Other")) 

Lets do it again, but this time let us make it 1 and 0, 1 if it is a country in the Indian Subcontinent, 0 if otherwise.

tb |> 
  mutate(indian_sub3 = case_when(country %in% indian_subcont ~ 1,
                           TRUE ~ 0))

Note

The if_else() function may result in slightly shorter code if you only need to code for 2 options. For more options, nested if_else() statements become hard to read and could result in mismatched parentheses so case_when() will be a more elegant solution.

As a second example of case_when(), let’s say we wanted to create a new population variable that is low, medium, or high.

See the pop broken into 3 equally sized portions

quantile(tb$pop, prob = c(.33, .66))

Note

See the help file for quanile function or type ?quantile in the console.

We’ll say:

  • low pop = 2043237 or less
  • med pop = between 2043237 and 11379155
  • high pop = above 11379155
tb |> 
  mutate(popcat = case_when(pop <= 2043237 ~ "low",
                            pop > 2043237 & pop <= 11379155 ~ "med",
                            TRUE ~ "high"))

15.2.5 join()

Typically in a data science or data analysis project one would have to work with many sources of data. The researcher must be able to combine multiple datasets to answer the questions he or she is interested in. Collectively, these multiple tables of data area called relational data because more than the individual datasets, its the relations that are more important.

As with the other dplyr verbs, there are different families of verbs that are designed to work with relational data and one of the most commonly used family of verbs are the mutating joins.

Different type of joins, represented by a series of Venn Diagram

These include:

  • left_join(x, y) which combines all columns in data frame x with those in data frame y but only retains rows from x.

  • right_join(x, y) also keeps all columns but operates in the opposite direction, returning only rows from y.

  • full_join(x, y) combines all columns of x with all columns of y and retains all rows from both data frames.

  • inner_join(x, y) combines all columns present in either x or y but only retains rows that are present in both data frames.

  • anti_join(x, y) returns the columns from x only and retains rows of x that are not present in y.

  • anti_join(y, x) returns the columns from y only and retains rows of y that are not present in x.

Visual representation of the join() family of verbs

Apart from specifying the data frames to be joined, we also need to specify the key column(s) that is to be used for joining the data. Key columns are specified with the by argument, e.g. inner_join(x, y, by = "subject_id") adds columns of y to x for all rows where the values of the “subject_id” column (present in each data frame) match. If the name of the key column is different in both the dataframes, e.g. “subject_id” in x and “subj_id” in y, then you have to specify both names using by = c("subject_id" = "subj_id").

15.2.6 pivot()

:::{style=“text-align:justify”}

Most often, when working with our data we may have to reshape our data from long format to wide format and back. We can use the pivot family of functions to achieve this task. What we mean by “the shape of our data” is how the values are distributed across rows or columns. Here’s a visual representation of the same data in two different shapes:

Long and Wide format of our data

  • “Long” format is where we have a column for each of the types of things we measured or recorded in our data. In other words, each variable has its own column.

  • “Wide” format occurs when we have data relating to the same measured thing in different columns. In this case, we have values related to our “metric” spread across multiple columns (a column each for a year).

Let us now use the pivot functions to reshape the data in practice. The two pivot functions are:

  • pivot_wider(): from long to wide format.
  • pivot_longer(): from wide to long format.

:::{style=“text-align:justify”}

Resources for learning more dplyr

  • Check out the Data Wrangling cheatsheet that covers dplyr and tidyr functions.(https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

  • Review the Tibbles chapter of the excellent, free R for Data Science book.(https://r4ds.had.co.nz/tibbles.html)

  • Check out the Transformations chapter to learn more about the dplyr package. Note that this chapter also uses the graphing package ggplot2 which we have covered yesterday.(https://r4ds.had.co.nz/transform.html)

  • Check out the Relational Data chapter to learn more about the joins.(https://r4ds.had.co.nz/relational-data.html)

Exercise


  1. In 2007, which 10 countries had the highest incidence_100k?

  2. Within the South East Asia who_region, which countries have incidence_100K > 300?

  3. How many countries are in each who_region? Put the output in order from lowest to highest number of countries. Hint: use distinct() and arrange()

  4. Which country in which year has the highest incidence of tuberculosis?

  5. Excluding missing values on hiv_incidence_100k, what is the correlation coefficient for the relationship between the tb incidence_100k and the hiv_incidence_100k for each region. Show the output with the highest correlations first. Hint: use cor()

15.3 skimr Package

skimr is designed to provide summary statistics about variables in data frames, tibbles, data tables and vectors. The core function of skimr is the skim() function, which is designed to work with (grouped) data frames, and will try coerce other objects to data frames if possible.

Give skim() a try.

tb |> skimr::skim()

Check out the names of the output of skimr

tb |> skimr::skim() |> names()

Also works with dplyr verbs

tb |> group_by(who_region) |> skimr::skim()
tb |> skimr::skim() |> 
  dplyr::select(skim_type, skim_variable, n_missing)

15.4 DataExplorer Package

The DataExplorer package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.1

The single most important function from the DataExplorer package is create_report()

Try it for yourself.

library(DataExplorer)
create_report(tb)

Exercise

  1. Create a report on Exploratory Data Analysis using RMarkdown

  2. Use DataExplorer to explore the given data


  1. DataExplorer Package↩︎